image and video
Elon Musk's Alternate Grok Reality
Amid a scandal over nonconsensual sexual images, Musk says his AI chatbot is a force for "truth and beauty." Get your news from a source that's not owned and controlled by oligarchs. In much of the world, Grok and its parent company both appear to be in serious trouble. After Grok, X's AI chatbot, has been used to generate sexualized and violent images of women and children, the social media company has faced a wave of backlash and censure, with new nationwide bans on accessing Grok in place and other consequences on the way. On Monday, the EU threatened to fine X under its broad Digital Services Act if it didn't act "quickly" to fix Grok, in the words of one regulator.
- Europe > United Kingdom (0.16)
- Asia > Malaysia (0.15)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.05)
- Asia > Indonesia (0.05)
- Media (1.00)
- Law (1.00)
- Government > Regional Government (0.71)
- Information Technology > Services (0.69)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
Grok Is Generating Sexual Content Far More Graphic Than What's on X
Grok Is Generating Sexual Content Far More Graphic Than What's on X A WIRED review of outputs hosted on Grok's official website shows it's being used to create violent sexual images and videos, as well as content that includes apparent minors. Elon Musk's Grok chatbot has drawn outrage and calls for investigation after being used to flood X with "undressed" images of women and sexualized images of what appear to be minors. However, that's not the only way people have been using the AI to generate sexualized images. Grok's website and app, which are are separate from X, include sophisticated video generation that is not available on X and is being used to produce extremely graphic, sometimes violent, sexual imagery of adults that is vastly more explicit than images created by Grok on X. It may also have been used to create sexualized videos of apparent minors.
- North America > United States > California (0.14)
- Europe > United Kingdom > Wales (0.04)
- Europe > Slovakia (0.04)
- (2 more...)
- Media (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- (2 more...)
Elon Musk's Pornography Machine
On X, sexual harassment and perhaps even child abuse are the latest memes. Earlier this week, some people on X began replying to photos with a very specific kind of request. "Put her in a bikini," "take her dress off," "spread her legs," and so on, they commanded Grok, the platform's built-in chatbot. Again and again, the bot complied, using photos of real people--celebrities and noncelebrities, including some who appear to be young children--and putting them in bikinis, revealing underwear, or sexual poses. By one estimate, Grok generated one nonconsensual sexual image every minute in a roughly 24-hour stretch.
- Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.89)
- Health & Medicine > Therapeutic Area > Pediatrics/Neonatology (0.35)
Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens
Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how could we leverage these for a video downstream task? We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights.
S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces
Visual data such as images and videos are typically modeled as discretizations of inherently continuous, multidimensional signals. Existing continuous-signal models attempt to exploit this fact by modeling the underlying signals of visual (e.g., image) data directly. However, these models have not yet been able to achieve competitive performance on practical vision tasks such as large-scale image and video classification. Building on a recent line of work on deep state space models (SSMs), we propose \method, a new multidimensional SSM layer that extends the continuous-signal modeling ability of SSMs to multidimensional data including images and videos. We show that S4ND can model large-scale visual data in $1$D, $2$D, and $3$D as continuous multidimensional signals and demonstrates strong performance by simply swapping Conv2D and self-attention layers with \method\ layers in existing state-of-the-art models.
Composing Concepts from Images and Videos via Concept-prompt Binding
Kong, Xianghao, Zhang, Zeyu, Guo, Yuwei, Zhao, Zhuoran, Zhang, Songchun, Rao, Anyi
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
- Europe > Switzerland (0.05)
- North America > United States (0.04)
- Asia (0.04)
How to glimpse a pre-AI internet
Slop Evader isn't meant as a solution, but it gives a temporary reprieve. Breakthroughs, discoveries, and DIY tips sent every weekday. A sizable portion of the internet has devolved into an AI-contaminated wasteland . While an easy solution remains elusive, a browser extension called Slop Evader offers a glimpse at what the internet to be only a few short years ago. While always prone to innumerable hazards, the online ecosystem is degrading largely due to the misuse of generative artificial intelligence content .
- North America > United States > California (0.05)
- Asia > Middle East > UAE > Dubai Emirate > Dubai (0.05)
- Asia > Japan (0.05)
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
Mei, Haiyang, Huang, Qiming, Ci, Hai, Shou, Mike Zheng
Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.
- Asia > Singapore (0.40)
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)